iclr 2026
Convergence of Muon with Newton-Schulz
We analyze Muon as originally proposed and used in practice -- using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number $q$ of Newton-Schulz steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in $q$ and improves with the degree of the polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at a much faster wall-clock time and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practice-theory gap.
- Asia > Middle East > Jordan (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Unifying View of Coverage in Linear Off-Policy Evaluation
Amortila, Philip, Huang, Audrey, Krishnamurthy, Akshay, Jiang, Nan
Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form $$ \textrm{Evaluation error} \le \textrm{poly}(C^π, d, 1/n,\log(1/δ)), $$ where $d$ is the dimension of the features and $C^π$ is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions -- such as Bellman-completeness -- our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Asia > Middle East > Jordan (0.04)
Direct Doubly Robust Estimation of Conditional Quantile Contrasts
Givens, Josh, Liu, Song, Reeve, Henry W J, Reluga, Katarzyna
Within heterogeneous treatment effect (HTE) analysis, various estimands have been proposed to capture the effect of a treatment conditional on covariates. Recently, the conditional quantile comparator (CQC) has emerged as a promising estimand, offering quantile-level summaries akin to the conditional quantile treatment effect (CQTE) while preserving some interpretability of the conditional average treatment effect (CATE). It achieves this by summarising the treated response conditional on both the covariates and the untreated response. Despite these desirable properties, the CQC's current estimation is limited by the need to first estimate the difference in conditional cumulative distribution functions and then invert it. This inversion obscures the CQC estimate, hampering our ability to both model and interpret it. To address this, we propose the first direct estimator of the CQC, allowing for explicit modelling and parameterisation. This explicit parameterisation enables better interpretation of our estimate while also providing a means to constrain and inform the model. We show, both theoretically and empirically, that our estimation error depends directly on the complexity of the CQC itself, improving upon the existing estimation procedure. Furthermore, it retains the desirable double robustness property with respect to nuisance parameter estimation. We further show our method to outperform existing procedures in estimation accuracy across multiple data scenarios while varying sample size and nuisance error. Finally, we apply it to real-world data from an employment scheme, uncovering a reduced range of potential earnings improvement as participant age increases.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
SoftStep: Learning Sparse Similarity Powers Deep Neighbor-Based Regression
Susman, Aviad, Lin, Baihan, Suárez-Fariñas, Mayte, Colonel, Joseph T
Neighbor-based methods are a natural alternative to linear prediction for tabular data when relationships between inputs and targets exhibit complexity such as nonlinearity, periodicity, or heteroscedasticity. Yet in deep learning on unstructured data, nonparametric neighbor-based approaches are rarely implemented in lieu of simple linear heads. This is primarily due to the ability of systems equipped with linear regression heads to co-learn internal representations along with the linear head's parameters. To unlock the full potential of neighbor-based methods in neural networks we introduce SoftStep, a parametric module that learns sparse instance-wise similarity measures directly from data. When integrated with existing neighbor-based methods, SoftStep enables regression models that consistently outperform linear heads across diverse architectures, domains, and training scenarios. We focus on regression tasks, where we show theoretically that neighbor-based prediction with a mean squared error objective constitutes a metric learning algorithm that induces well-structured embedding spaces. We then demonstrate analytically and empirically that this representational structure translates into superior performance when combined with the sparse, instance-wise similarity measures introduced by SoftStep. Beyond regression, SoftStep is a general method for learning instance-wise similarity in deep neural networks, with broad applicability to attention mechanisms, metric learning, representational alignment, and related paradigms.
- Health & Medicine > Diagnostic Medicine > Imaging (0.93)
- Health & Medicine > Nuclear Medicine (0.68)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)
- Education > Educational Technology > Educational Software > Computer Based Training (0.34)
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Gao, Silin, Bosselut, Antoine, Bengio, Samy, Abbe, Emmanuel
Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
- Research Report > Experimental Study (0.46)
- Research Report > New Finding (0.46)
E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models
Yuan, Tao, Bai, Haoli, Pan, Yinfei, Cao, Xuyang, Zhang, Tianyu, Hou, Lu, Hu, Ting, Yu, Xianzhi
With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Mind Your Entropy: From Maximum Entropy to Trajectory Entropy-Constrained RL
Zhan, Guojian, Wang, Likun, Wang, Pengcheng, Zhang, Feihong, Duan, Jingliang, Tomizuka, Masayoshi, Li, Shengbo Eben
Maximum entropy has become a mainstream off-policy reinforcement learning (RL) framework for balancing exploitation and exploration. However, two bottlenecks still limit further performance improvement: (1) non-stationary Q-value estimation caused by jointly injecting entropy and updating its weighting parameter, i.e., temperature; and (2) short-sighted local entropy tuning that adjusts temperature only according to the current single-step entropy, without considering the effect of cumulative entropy over time. In this paper, we extends maximum entropy framework by proposing a trajectory entropy-constrained reinforcement learning (TECRL) framework to address these two challenges. Within this framework, we first separately learn two Q-functions, one associated with reward and the other with entropy, ensuring clean and stable value targets unaffected by temperature updates. Then, the dedicated entropy Q-function, explicitly quantifying the expected cumulative entropy, enables us to enforce a trajectory entropy constraint and consequently control the policy long-term stochasticity. Building on this TECRL framework, we develop a practical off-policy algorithm, DSAC-E, by extending the state-of-the-art distributional soft actor-critic with three refinements (DSAC-T). Empirical results on the OpenAI Gym benchmark demonstrate that our DSAC-E can achieve higher returns and better stability.
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Puerto Rico > San Juan > San Juan (0.04)
- (5 more...)
Thought Branches: Interpreting LLM Reasoning Requires Resampling
Macar, Uzay, Bogdan, Paul C., Rajamanoharan, Senthooran, Nanda, Neel
Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
- Law (0.57)
- Information Technology > Security & Privacy (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Functional embeddings enable Aggregation of multi-area SEEG recordings over subjects and sessions
Javadzadeh, Sina, Soroushmojdehi, Rahil, Mousavi, S. Alireza Seyyed, Asadi, Mehrnaz, Abe, Sumiko, Sanger, Terence D.
Aggregating intracranial recordings across subjects is challenging since electrode count, placement, and covered regions vary widely. Spatial normalization methods like MNI coordinates offer a shared anatomical reference, but often fail to capture true functional similarity, particularly when localization is imprecise; even at matched anatomical coordinates, the targeted brain region and underlying neural dynamics can differ substantially between individuals. We propose a scalable representation-learning framework that (i) learns a subject-agnostic functional identity for each electrode from multi-region local field potentials using a Siamese encoder with contrastive objectives, inducing an embedding geometry that is locality-sensitive to region-specific neural signatures, and (ii) tokenizes these embeddings for a transformer that models inter-regional relationships with a variable number of channels. We evaluate this framework on a 20-subject dataset spanning basal ganglia-thalamic regions collected during flexible rest/movement recording sessions with heterogeneous electrode layouts. The learned functional space supports accurate within-subject discrimination and forms clear, region-consistent clusters; it transfers zero-shot to unseen channels. The transformer, operating on functional tokens without subject-specific heads or supervision, captures cross-region dependencies and enables reconstruction of masked channels, providing a subject-agnostic backbone for downstream decoding. Together, these results indicate a path toward large-scale, cross-subject aggregation and pretrain-ing for intracranial neural data where strict task structure and uniform sensor placement are unavailable. Building models that generalize across subjects in neuroscience requires representations that remain stable despite variability in data acquisition. Intracranial neural recordings lack this stability: electrode locations, counts, sampling, and coverage differ across individuals, reflecting clinical needs rather than standardized layouts. Without a shared representational system, cross-subject aggregation is unreliable, limiting scalable modeling and clinical translation. Such recordings are uniquely valuable for studying inter-regional communication, yet their heterogeneity makes them especially challenging to align. In practice, two obstacles dominate: Anatomical variability and inconsistent electrode coverage.
- North America > United States > California > Orange County > Irvine (0.14)
- North America > United States > Wisconsin > Milwaukee County > Oak Creek (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Health Care Technology (0.93)
Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition
Razig, Amine, Soulaymani, Youssef, Benabbou, Loubna, Cauchy, Pierre
Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.